Search CORE

100 research outputs found

Development and evaluation of an open source software tool for deidentification of pathology reports

Author: AH Namini
Bruce A Beckwith
D Gupta
Frank Kuo
JJ Berman
L Sweeney
L Sweeney
L Sweeney
R Miller
Rajeshwarri Mahaadevan
RK Taira
SM Thomas
Ulysses J Balis
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: Electronic medical records, including pathology reports, are often used for research purposes. Currently, there are few programs freely available to remove identifiers while leaving the remainder of the pathology report text intact. Our goal was to produce an open source, Health Insurance Portability and Accountability Act (HIPAA) compliant, deidentification tool tailored for pathology reports. We designed a three-step process for removing potential identifiers. The first step is to look for identifiers known to be associated with the patient, such as name, medical record number, pathology accession number, etc. Next, a series of pattern matches look for predictable patterns likely to represent identifying data; such as dates, accession numbers and addresses as well as patient, institution and physician names. Finally, individual words are compared with a database of proper names and geographic locations. Pathology reports from three institutions were used to design and test the algorithms. The software was improved iteratively on training sets until it exhibited good performance. 1800 new pathology reports were then processed. Each report was reviewed manually before and after deidentification to catalog all identifiers and note those that were not removed. RESULTS: 1254 (69.7 %) of 1800 pathology reports contained identifiers in the body of the report. 3439 (98.3%) of 3499 unique identifiers in the test set were removed. Only 19 HIPAA-specified identifiers (mainly consult accession numbers and misspelled names) were missed. Of 41 non-HIPAA identifiers missed, the majority were partial institutional addresses and ages. Outside consultation case reports typically contain numerous identifiers and were the most challenging to deidentify comprehensively. There was variation in performance among reports from the three institutions, highlighting the need for site-specific customization, which is easily accomplished with our tool. CONCLUSION: We have demonstrated that it is possible to create an open-source deidentification program which performs well on free-text pathology reports

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Automated Detection of Radiology Reports that Document Non-routine Communication of Critical or Significant Results

Author: A Agresti
CP Langlotz
Curtis P. Langlotz
DW Bates
FW Lancaster
G Hripcsak
G Hripcsak
G Hripcsak
H Singh
I Goldstein
JA Inglefinger
KJ Dreyer
L Berlin
L Berlin
P Lakhani
Paras Lakhani
R Farkas
RK Taira
T Imai
WW Chapman
Publication venue: Springer-Verlag
Publication date: 01/01/2009
Field of study

The purpose of this investigation is to develop an automated method to accurately detect radiology reports that indicate non-routine communication of critical or significant results. Such a classification system would be valuable for performance monitoring and accreditation. Using a database of 2.3 million free-text radiology reports, a rule-based query algorithm was developed after analyzing hundreds of radiology reports that indicated communication of critical or significant results to a healthcare provider. This algorithm consisted of words and phrases used by radiologists to indicate such communications combined with specific handcrafted rules. This algorithm was iteratively refined and retested on hundreds of reports until the precision and recall did not significantly change between iterations. The algorithm was then validated on the entire database of 2.3 million reports, excluding those reports used during the testing and refinement process. Human review was used as the reference standard. The accuracy of this algorithm was determined using precision, recall, and F measure. Confidence intervals were calculated using the adjusted Wald method. The developed algorithm for detecting critical result communication has a precision of 97.0% (95% CI, 93.5–98.8%), recall 98.2% (95% CI, 93.4–100%), and F measure of 97.6% (ß = 1). Our query algorithm is accurate for identifying radiology reports that contain non-routine communication of critical or significant results. This algorithm can be applied to a radiology reports database for quality control purposes and help satisfy accreditation requirements

Crossref

Springer - Publisher Connector

PubMed Central

De-identification of primary care electronic medical records free-text data in Ontario, Canada

Author: B Wellner
BA Beckwith
C Grouin
Chiriac Mihai
D Gupta
DMTI Spatial Inc
DMTI Spatial Inc
G Szarvas
I Neamatullah
JJ Berman
Joel Martin
Julie Klein-Geltink
Karen Tu
L Sweeney
M Sokolova
O Uzuner
O Uzuner
Report of the WHO Global Observatory for eHealth
RK Taira
S Velupillai
Scott's Directories
SM Thomas
T Mitiku
Tezeta F Mitiku
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Electronic medical records (EMRs) represent a potentially rich source of health information for research but the free-text in EMRs often contains identifying information. While de-identification tools have been developed for free-text, none have been developed or tested for the full range of primary care EMR data Methods We used <it>deid </it>open source de-identification software and modified it for an Ontario context for use on primary care EMR data. We developed the modified program on a training set of 1000 free-text records from one group practice and then tested it on two validation sets from a random sample of 700 free-text EMR records from 17 different physicians from 7 different practices in 5 different cities and 500 free-text records from a group practice that was in a different city than the group practice that was used for the training set. We measured the sensitivity/recall, precision, specificity, accuracy and F-measure of the modified tool against manually tagged free-text records to remove patient and physician names, locations, addresses, medical record, health card and telephone numbers. Results We found that the modified training program performed with a sensitivity of 88.3%, specificity of 91.4%, precision of 91.3%, accuracy of 89.9% and F-measure of 0.90. The validations sets had sensitivities of 86.7% and 80.2%, specificities of 91.4% and 87.7%, precisions of 91.1% and 87.4%, accuracies of 89.0% and 83.8% and F-measures of 0.89 and 0.84 for the first and second validation sets respectively. Conclusion The <it>deid </it>program can be modified to reasonably accurately de-identify free-text primary care EMR records while preserving clinical content.</p

University of Toronto Research Repository

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Discerning Tumor Status from Unstructured MRI Reports—Completeness of Information in Existing Reports and Utility of Automated Natural Language Processing

Author: AB Miller
BI Reiner
BJ Thomas
Bradley J. Erickson
C Cortes
CL Sistrom
CP Langlotz
E Galanis
G Hripcsak
G Hripcsak
G Hripcsak
GB Melton
GK Savova
Guergana K. Savova
I McCowan
IA McCowan
JC Denny
Jiaping Zheng
JL Hobby
JS Elkins
KJ Dreyer
L Berlin
L Zhou
Lionel T. E. Cheng
NR Dunnick
P Therasse
PM Hickey
R Khorasani
RK Taira
S Pakhomov
SS Naik
Y Lin
Publication venue: Springer-Verlag
Publication date: 01/01/2009
Field of study

Information in electronic medical records is often in an unstructured free-text format. This format presents challenges for expedient data retrieval and may fail to convey important findings. Natural language processing (NLP) is an emerging technique for rapid and efficient clinical data retrieval. While proven in disease detection, the utility of NLP in discerning disease progression from free-text reports is untested. We aimed to (1) assess whether unstructured radiology reports contained sufficient information for tumor status classification; (2) develop an NLP-based data extraction tool to determine tumor status from unstructured reports; and (3) compare NLP and human tumor status classification outcomes. Consecutive follow-up brain tumor magnetic resonance imaging reports (2000–2007) from a tertiary center were manually annotated using consensus guidelines on tumor status. Reports were randomized to NLP training (70%) or testing (30%) groups. The NLP tool utilized a support vector machines model with statistical and rule-based outcomes. Most reports had sufficient information for tumor status classification, although 0.8% did not describe status despite reference to prior examinations. Tumor size was unreported in 68.7% of documents, while 50.3% lacked data on change magnitude when there was detectable progression or regression. Using retrospective human classification as the gold standard, NLP achieved 80.6% sensitivity and 91.6% specificity for tumor status determination (mean positive predictive value, 82.4%; negative predictive value, 92.0%). In conclusion, most reports contained sufficient information for tumor status determination, though variable features were used to describe status. NLP demonstrated good accuracy for tumor status classification and may have novel application for automated disease status classification from electronic databases

Crossref

Springer - Publisher Connector

PubMed Central

Pattern-based information extraction from pathology reports for cancer registration

Author: A Breslow
A Coden
A Hotho
A Turchin
Colin Fox
David Connolly
DF Gleason
Giulio Napolitano
JEF Friedl
LH Sobin
N Collier
R Stevens
Richard Middleton
RK Taira
WHJ Clark
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Automatic de-identification of textual documents in the electronic health record: a review of recent research

Author: B Wellner
BA Beckwith
Brett R South
C Friedman
D Gupta
DA Dorr
E Aramaki
EM Fielstein
F Jeffrey Friedlin
FJ Friedlin
FP Morrison
G Szarvas
G Szarvas
GPO U.S
GPO U.S
H Cunningham
I Neamatullah
J Gardner
JJ Berman
K Atkinson
K Hara
L Sweeney
Matthew H Samore
NCI
NLM
NLM
NLM
O Uzuner
O Uzuner
O Uzuner
O Uzuner
P Ruch
RK Taira
Shuying Shen
SM Meystre
SM Thomas
SM Thomas
Stephane M Meystre
Y Guo
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background In the United States, the Health Insurance Portability and Accountability Act (HIPAA) protects the confidentiality of patient data and requires the informed consent of the patient and approval of the Internal Review Board to use data for research purposes, but these requirements can be waived if data is de-identified. For clinical data to be considered de-identified, the HIPAA "Safe Harbor" technique requires 18 data elements (called PHI: Protected Health Information) to be removed. The de-identification of narrative text documents is often realized manually, and requires significant resources. Well aware of these issues, several authors have investigated automated de-identification of narrative text documents from the electronic health record, and a review of recent research in this domain is presented here. Methods This review focuses on recently published research (after 1995), and includes relevant publications from bibliographic queries in PubMed, conference proceedings, the ACM Digital Library, and interesting publications referenced in already included papers. Results The literature search returned more than 200 publications. The majority focused only on structured data de-identification instead of narrative text, on image de-identification, or described manual de-identification, and were therefore excluded. Finally, 18 publications describing automated text de-identification were selected for detailed analysis of the architecture and methods used, the types of PHI detected and removed, the external resources used, and the types of clinical documents targeted. All text de-identification systems aimed to identify and remove person names, and many included other types of PHI. Most systems used only one or two specific clinical document types, and were mostly based on two different groups of methodologies: pattern matching and machine learning. Many systems combined both approaches for different types of PHI, but the majority relied only on pattern matching, rules, and dictionaries. Conclusions In general, methods based on dictionaries performed better with PHI that is rarely mentioned in clinical text, but are more difficult to generalize. Methods based on machine learning tend to perform better, especially with PHI that is not mentioned in the dictionaries used. Finally, the issues of anonymization, sufficient performance, and "over-scrubbing" are discussed in this publication.</p

Crossref

IUPUIScholarWorks

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

LC/MS-Based Quantitative Proteomic Analysis of Paraffin-Embedded Archival Melanomas Reveals Potential Proteomic Biomarkers Associated with Metastasis

BACKGROUND: Melanoma metastasis status is highly associated with the overall survival of patients; yet, little is known about proteomic changes during melanoma tumor progression. To better understand the changes in protein expression involved in melanoma progression and metastasis, and to identify potential biomarkers, we conducted a global quantitative proteomic analysis on archival metastatic and primary melanomas. METHODOLOGY AND FINDINGS: A total of 16 metastatic and 8 primary cutaneous melanomas were assessed. Proteins were extracted from laser captured microdissected formalin fixed paraffin-embedded archival tissues by liquefying tissue cells. These preparations were analyzed by a LC/MS-based label-free protein quantification method. More than 1500 proteins were identified in the tissue lysates with a peptide ID confidence level of >75%. This approach identified 120 significant changes in protein levels. These proteins were identified from multiple peptides with high confidence identification and were expressed at significantly different levels in metastases as compared with primary melanomas (q-Value<0.05). CONCLUSIONS AND SIGNIFICANCE: The differentially expressed proteins were classified by biological process or mapped into biological system networks, and several proteins were implicated by these analyses as cancer- or metastasis-related. These proteins represent potential biomarkers for tumor progression. The study successfully identified proteins that are differentially expressed in formalin fixed paraffin-embedded specimens of metastatic and primary melanoma

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central